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We develop a new principal components analysis (PCA) type di- 
mension reduction method for binary data. Different from the stan- 
dard PCA which is defined on the observed data, the proposed PCA 
is defined on the logit transform of the success probabilities of the bi- 
nary observations. Sparsity is introduced to the principal component 
Qh ' (PC) loading vectors for enhanced interpretability and more stable 

' extraction of the principal components. Our sparse PCA is formu- 

lated as solving an optimization problem with a criterion function 
' motivated from a penalized Bernoulli likelihood. A Majorization- 

^ Minimization algorithm is developed to efficiently solve the optimiza- 

tion problem. The effectiveness of the proposed sparse logistic PCA 
method is illustrated by application to a single nucleotide polymor- 
^ , phism data set and a simulation study. 

\ 1. Introduction. Principal components analysis (PCA) is a widely used 

Cf^ ■ method for dimensionality reduction, feature extraction and visualization of 

multivariate data. Several sparse PCA methods have recently been intro- 
duced to improve the standard PCA [e.g., Jolliffe, Trendafilov and Uddine 
(2003); Zou, Hastie and Tibshirani (2006); Shen and Huang (2008)]. By re- 
quiring the principal component loading vectors to be sparse, sparse PCA 
methods yield PCs that are more easily interpretable. Sparsity also regu- 
^ ' larizes the extraction of PCs and thus makes the extraction more stable. 
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Such stability is much desired when the dimension is high, especially in 
the so-called high-dimension low-sample-size settings. As extensions of the 
standard PCA, however, these sparse PCA methods are mostly suitable to 
variables of a continuous type, they are not generally appropriate for other 
data types such as binary data or counts. Although the basic objective of 
PCA, or its sparse version, can be achieved regardless of the nature of the 
original variable, it is true that variances and covariances have especial rele- 
vance for multivariate Gaussian variables, and that linear functions of binary 
variables are less readily interpretable than linear functions of continuous 
variables [Jolliffe (2002)]. The goal of this paper is to develop a sparse PCA 
method for binary data. 

There are two commonly used definitions of PCA that give rise to the 
same result. PCA can be defined by finding the orthogonal projection of 
the data onto a low dimensional linear subspace such that the variance of 
the projected data is maximized [Hotelling (1933)]. Alternatively, PCA can 
also be defined by finding the linear projection that minimizes the mean 
squared distance between the data points and their projections [Pearson 
(1901)]. Shen and Huang (2008) developed their sparse PCA method follow- 
ing the viewpoint of Pearson. Suppose yi, . . . , y„ S M'^ are the n data points 
and consider a fc-dimensional {k < d) linear manifold spanned by a bases 
bi, . . . ,bfc with a shift vector /x. According to Pearson, the PCA minimizes 
the following reconstruction error, 

n 

(1-1) ^ ||yi - (At + fliibi hajfcbfc)]!^: 

i=l 

subject to the constraint that A = (aij) has orthonormal columns. Usually 
the variables presented in yj are scaled so that they have the same order of 
magnitude. Note that (1.1) is a least squares regression if Cj^'s were known. 
In light of this connection to regression and borrowing the idea from LASSO 
[Tibshirani (1996)], Shen and Huang (2008) proposed to add an Li penalty 
||bi||i H — • -|- ]|bfc||i to the reconstruction error (1.1) to obtain sparse loading 
vectors bi,...,bfc. Since the reconstruction error (1.1) can be viewed as 
the negative log likelihood up to a constant for the Gaussian distributions 
with mean vectors 9i = fj, + anhi + •••-!- aj^b/j for i = 1, . . . ,n and identity 
covariance, the method of Shen and Huang can be interpreted as a penalized 
likelihood approach for the sparse PCA. The key idea of the current paper is 
to replace the Gaussian likelihood by the Bernoulli likelihood where 6i will 
be the logit transform of the success probabilities. We refer to the proposed 
PCA method as sparse logistic PCA. The relationship of the proposed sparse 
logistic PCA to the sparse PCA of Shen and Huang is analogous to the 
relationship between logistic and linear LASSO regression. 
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We develop an iterative weighted least squares algorithm to perform 
the proposed sparse logistic PCA. Since the log Bernoulli likelihood is not 
quadratic and the Li penalty function is nondifferentiable, the optimization 
problem defining the sparse logistic PCA is not straightforward to solve. Our 
algorithm applies the general idea of optimization transfer or Majorization- 
Minimization (MM) algorithm [Lange, Hunter and Yang (2000); Hunter and 
Lange (2004)]. By iteratively replacing the complex objective function with 
suitably defined quadratic surrogates, each step of our algorithm solves a 
weighted least squares problem and has closed form. The algorithm is easy to 
implement and guaranteed at each iteration to improve the penalized PCA 
log- likelihood. We show that the same MM algorithm is applicable when 
there are missing data. We also develop a method for choosing the penalty 
parameters and for choosing the number of important principal components. 
PCA of binary data using Bernoulli likelihood has previously been studied 
by Collins, Dasgupta and Schapire (2002), Schein, Saul and Ungar (2003) 
and de Leeuw (2006), but none of these works considered sparse loading 
vectors. As we demonstrate using simulation and real data, sparsity can en- 
hance interpretation of results and improve the stability and accuracy of the 
extracted principal components. 

Other approaches of sparse PCA are not as easily extendible to binary 
data. Jolliffe, Trendafilov and Uddine (2003) modified the defining maximum 
variance problem of the standard PCA by applying an Li-norm constraint 
on the PC loading vectors to obtain PCA with sparse loadings. Its use 
of sample variance makes it unappealing for binary data. Zou, Hastie and 
Tibshirani (2006) rewrote PCA as a regression- type optimization problem 
and then applied the LASSO penalty [Tibshirani (1996)] to obtain sparse 
loadings. However, since the data appear both as regressors and responses 
in their regression- type problem, the connection of their approach to the 
penalized likelihood is not as natural as Shen and Huang (2008). 

The rest of this article is organized as follows. In Section 2 we introduce 
the optimization problem that yields the sparse logistic PCA and provides 
methods for tuning parameter selection. Section 3 applies the sparse logistic 
PCA to a single nucleotide polymorphism data set and compares it with 
the nonsparse version of logistic PCA. Section 4 presents a Majorization- 
Minimization algorithm for efficient computation of the sparse logistic PCA 
and Section 5 discusses how to handle missing data. Results of a simula- 
tion study are given in Section 6. Section 7 concludes the paper with some 
discussion. The Appendix contains proofs of theorems. 

2. Sparse logistic PCA with penalized likelihood. 

2.1. Penalized Bernoulli likelihood. Consider the n x d binary data ma- 
trix Y = {yij), each row of which represents a vector of observations from 
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binary variables. We assume that entries of Y are realizations of mutu- 
ally independent random variables and that yij follows the Bernoulli dis- 
tribution with success probability vTjj. Let 9ij = log{7rjj/(l — vTjj)} be the 
logit transformation of VTy. Define the inverse logit transformation Tr{9) = 
{1 + exp(-6l)}"i. Then the success probabilities can be represented using 
the canonical parameters as nij =7r{6ij). The individual data generating 
probability becomes 

with Qij = 2yij — 1 since vr(— 0) = 1 — it{9). This representation leads to the 
compact form of the log likelihood as 

n d 

(2.1) £ = 5^J]log^(g,,%). 

i=i j=i 

Note that the Bernoulli distributions are in the exponential family and 6ij 
are the corresponding canonical parameters. 

To build a probabilistic model for principal components analysis of binary 
data, the d-dimensional canonical parameter vectors 6i = [On, . . . , 9id)'^ are 
constrained to reside in a low dimensional manifold of M.'^ with the dimen- 
sionality k. (The choice of k will be discussed later in Section 2.3.) Specif- 
ically, we assume that, for some vectors fi, bi,...,bfc G W^, the vector of 
canonical parameters satisfies Oi = fi + anhi + • • ■ + ajfcbfc for i = 1, . . . , n. 
We call bi,...,bfc the principal component loading vectors and the coef- 
ficients aj = (oji, . . . , Ojfc)^ the principal component scores (PC scores) for 
the ith observation. Geometrically, the vectors of canonical parameters Oi 
are projected onto the /c-dimensional manifold which is the affine subspace 
spanned by k PC loading vectors and translated by the intercept vector fx. 
In matrix form, the canonical parameter matrix = (Oij) = {Oi, . . . , On)'^ is 
represented as 

(2.2) e = l„(8)Ai'^ + AB'^, 

where A = (ai, . . . ,a„)^ is the n x k principal component score matrix and 
B = (bi, . . . ,bfe) is the p x k principal component loading matrix. For iden- 
tifiability purpose, we require that A has orthonormal columns. 

We target a method that can produce a sparse loading matrix, a loading 
matrix with many zero elements. A sparse loading matrix implies variable 
selection in principal components analysis, since each principal component 
only involves those variables corresponding to the nonzero elements of the 
loading vector. We propose to perform variable selection using the penalized 
likelihood with a sparsity inducing penalty. Let bj denote the jth row of 
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B. Then (2.2) implies that Oij = + afhj where Hj is the jth element of 
fi. The log likelihood can be written as 

d n 

(2.3) £{fi, A,B) = Y,Y1 + afbj)}. 

j=i i=i 

If aj were observable, (2.3) is the log likelihood for d logistic regressions 

logitP{Yij = l) = ^ij + a[hj. 

This connection with logistic regression suggests use of the Li penalty to 
get a sparse loading matrix, as in the LASSO regression [Tibshirani (1996)]. 
Specifically, consider the penalty 

k d d 

(2.4) Pa(B)= J^Ai||bn|i = Ai J^|6,-i| + --- + Afc5^|6,fc|, 

1=1 j=i j=i 

where A; are regularization parameters whose selection will be discussed 
later. We obtain sparse principal components by maximizing the following 
penalized log likelihood: 

(2.5) /(/x,A,B)=£(/i,A,B)-7iP^(B), 

subject to the constraint that A has orthonormal columns. Note that B 
enters the likelihood together with A through AB-^ and so B can be arbi- 
trarily small by just increasing the magnitude of A and not changing the 
likelihood. The orthonormal constraint on A prevents elements of A be- 
coming arbitrarily large and thus validates our use of the Li penalty on 
B. 

The sparse principal components can be equivalently formulated as min- 
imizing the following criterion function: 

(2.6) S(/x,A,B) = -£(/2,A,B) + nPA(B), 

subject to the constraint that A has orthonormal columns. In (2.6) the 
negative log likelihood can be interpreted as a loss function and the Li 
penalties increase the loss for nonzero elements of B according to their 
magnitude. This penalized loss interpretation is also appealing in the sense 
that the independent Bernoulli trials assumption for obtaining the likelihood 
(2.3) need not be a realistic representation of the actual data generating 
process but rather a device for generating a suitable loss function. Since 
the Li penalties regularize the loss minimization, the sparse logistic PCA is 
sometimes also referred to as the regularized logistic PCA. We shall focus on 
the minimization problem (2.6) for the rest of the paper. A computational 
algorithm for solving the minimization problem is presented in Section 4. 
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Nonregularized PCA 




Regularized PCA 



Fig. 1. A simulated data set with n = 100, d — 200 and k = 1. Top, middle and bottom 
panels show respectively the true loadings, loadings from the nonregularized logistic PCA 
and from the regularized logistic PCA. The penalty parameter is selected using the BIC. 



The effectiveness of the proposed sparse logistic PCA is illustrated in 
Figure 1 using a rank-one model (i.e., k = 1). While the sparse logistic PCA 
can recover the original loading vector well, the nonregularized logistic PCA 
gives more noisy results. A systematic simulation study is reported in Section 
6. 

2.2. Choosing the penalty parameters. Although different penalty pa- 
rameters can be used for different PC loading vectors for maximal flexibility 
of the methodology, we consider using only a single penalty parameter A for 
all PC loadings. This simplification substantially reduces the computation 
time, especially when k is large. Note that a larger value of A will lead to 
a smaller number of nonzeros in the loading matrix B and reduced model 
complexity, but the reduced model complexity is usually associated with less 
good fit of the model. To compromise the goodness of fit and model com- 
plexity, for fixed k, we choose A by minimizing the following BIC criterion: 

(2.7) BIC(A) = -2^(/x, A, B) + logn x m(A), 

where m(A) is a measure of the degrees of freedom. Note that Zou, Hastie 
and Tibshirani (2007) showed that the number of nonzero coefficients is an 
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unbiased estimate of the degrees of freedom for the LASSO regression. The 
degrees of freedom m(A) used in (2.7) is defined as m{X) = d + nk + \B{X)\, 
where d is the length of the vector fi, nk is the total number of elements 
of A, and |^(A)| is the cardinality of the index set B{X) of the nonzero 
loadings in B when the penalty parameter is A. We use a grid search to find 
the optimal A that minimizes the BIG. 

2.3. Determining the dimensionality of the subspace. The BIG criterion 
defined in (2.7) can also be used to select a suitable "fc." A two-dimensional 
grid search can be used to find the minimizer of the BIG with respect to 
both k and A. To expedite computation, we implement the following strat- 
egy: First fix /c at a reasonable large value and select a good A, then using 
this A we refine the choice of k and, finally, we refine A with the refined k. 
When optimizing with respect to A, a coarse grid can be used in the first 
step and a finer grid in the second step. Our simulation study showed that 
this strategy works reasonably well (see Section 6.3). 

Remark 1. In classical multivariate analysis, the percentage of total 
variance explained by the principal components provides an intuitive mea- 
sure that can be used for subjectively choosing the appropriate number 
of principal components. Zou, Hastie and Tibshirani (2006) and Shen and 
Huang (2008) extended it to sparse PGA by modifying the definition of vari- 
ance explained by the PGs. Since there is no clear definition of total variance 
for the binary data, extension of the notion of "percentage of variance ex- 
plained" to logistic PGA is an interesting but unsolved problem. 

3. Application to single nucleotide polymorphism data. Association stud- 
ies based on high-throughput single nucleotide polymorphism (SNP) data 
[Brookes (1999); Kwok et al. (1996)] have become a popular way to detect 
genomic regions associated with human complex diseases. A SNP is a single 
base pair position in genomic DNA at which the sequence (alleles) varia- 
tion occurs between members of a species, wherein the least frequent allele 
has an abundance of 1% or greater. A crucial issue in association studies is 
population stratification detection [Hao et al. (2004)], which is to determine 
whether a population is homogeneous or has hidden structures within it. 
With the presence of population stratification, the naive case-control ap- 
proach not accounting for this factor would yield biased results [Ewens and 
Spielman (1995)] and, therefore, draw inaccurate scientific conclusions. See 
Liang and Kelemen (2008) for an extensive discussion of statistical methods 
and difficulties for SNP data analysis. 

The proposed sparse logistic PGA method can be used for population 
stratification detection. For the purpose of demonstration, we use the SNP 
data set available in the International HapMap project [The International 
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HapMap Consortium (2005)]. It consists of 3 different ethnic populations 
of 90 Caucasians (Utali residents witli ancestry from northern and west- 
ern Europe; CEO), 90 Africans (Yoruba in Ibadan, Nigeria; YRI) and 90 
Asians (45 Han Chinese in Beijing, China; CHB and 45 Japanese in Tokyo, 
Japan; JPT). Our task is to detect this three-subpopulation structure using 
the SNP data on the 270 subjects. At many SNP locations, heterozygosity 
distribution and allele frequency are known to be different among popula- 
tions and could confound the effect of the risk of disease. To account for 
this factor, Serre et al. (2008) selected 1536 SNPs with similar heterozygos- 
ity distribution and allele frequency. The locations of these SNPs cover all 
the chromosomes except for the sex-determining chromosome. Among these 
1536 SNPs, 1392 are shared by three ethnic groups, which are used in our 
analysis. We coded for the most prevalent homogeneous base pair (wild- 
type) and 1 for others (mutant), resulting in a 270 x 1392 binary matrix. 
This data matrix has 2.37% missing entries. 

We applied the sparse logistic PC A to this SNP data set to explore vari- 
ability among high dimensional SNP variables, using the computation al- 
gorithm given in Sections 4 and 5 below. The method described in Section 
2.3 was used for model selection. Specifically, we initially fixed the reduced 
dimension to k = 30 and chose the penalty parameter A among the rough 
grid of 0, 1.5"^*, 1.5~^^, . . . , 1.5~^^ using the BIC criterion defined in Section 
2.3. Given the selected A = 1.5~^®, the dimension k was refined by minimiz- 
ing the BIC, giving k = 10. Finally, with k = 10, we refined A by searching 
over the grid 0, 0.0005, 0.0010, 0.0015, . . . , 0.0100, resulting in A = 0.0015. As 
a comparison, we also applied the nonregularized logistic PCA to the data, 
which corresponds to A = in our general formulation of regularized logistic 
PCA. 

To examine which principal components represent the variability associ- 
ated with three racial groups, we used a F-test where scores for each fixed 
PC is regressed on the group dummy variables. For the sparse logistic PCA, 
only the first two PCs were highly significant with both p-values less than 
0.0001 and the remaining eight PCs were not significant with large p-values 
(0.7681, 0.9109, 0.4764, 0.5523, 0.3376, 0.5415, 0.4480, 0.6441 for the third to 
the tenth PCs respectively). This result suggests that the sparse logistic PCA 
can effectively compress the racial group information into two leading PCs. 
Similar compression was not achieved by the nonregularized logistic PCA; 
the i^-test was significant for all the first ten PCs with p-values <0.0001, 
<0.0001, 0.0002, 0.0001, <0.0001, <0.0001, <0.0001, 0.0028, <0.0001 and 
0.0299 respectively. 

Pairwise scatterplots were used to check clustering of subjects using the 
PC scores. Figure 2 shows the scatterplots of first 2 PC scores with and 
without regularization. The three ethnic groups are clearly separated by the 
regularized PCA but not by the nonregularized PCA. To verify that the 
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Fig. 2. The scatterplots of the first two PC scores from the nonregulanzed (left) and 
regularized logistic PCA. Circle, rectangle and triangle represent Caucasian, African and 
Asian population respectively. 



group separation obtained is not because of luck, we permuted observations 
for each SNP and applied the sparse logistic PCA to the permuted data set; 
no clear clustering showed up in the PC scores. 

The proposed sparse PCA method allows directly identifying the SNPs 
that contribute to the group separation. The selected model has 790 and 
658 nonzero loadings (representing the SNPs) respectively for the first 2 
PCs, among which 509 SNPs are shared. Therefore, 939 SNPs involved in 
the first 2 PC directions are claimed to be associated with the ethnic group 
effect. Our result suggests that the population stratification factor should 
be taken into consideration at these 939 SNP locations in the subsequent 
study of the association between SNPs and the disease phenotype to avoid 
biased conclusion. Although in light of our simulation results, some selected 
SNPs could be false positives, we believe that a large proportion of the 
selected SNPs are relevant in differentiation among the three racial groups, 
because the studied SNPs were delicately selected to represent the most 
genetic diversity of the whole genome [Serre et al. (2008)] and the genetic 
differentiation is the greatest when defined on a continental basis, which is 
the case for our comparison between Caucasian, Asian and African [Risch 
et al. (2002)]. 

We further compared the regularized and nonregularized logistic PCA by 
assessing the variability of the probability estimates using the parametric 
bootstrap. For each method, we generated 100 bootstrapped data sets of bi- 
nary matrices; each binary matrix has entries that are independently drawn 
from the Bernoulli distribution with success probability vrjj for the (i, j)th 
entry, where TTjj is the estimated probability. We then applied the method 
to these bootstrapped data sets to obtain 100 bootstrapped probabilities for 
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each (i, j) combination and to construct a 90% variability interval using the 
5% and 95% quantiles of the bootstrapped probabilities. These 90% vari- 
ability intervals were plotted against the ordered vTjj to form a variability 
envelop. The variability envelop for the regularized PC A is narrower than 
that for the nonregularized PCA, indicating that regularization indeed re- 
duces the variability of the probability estimates (Figure 3). 

Our working model for the logistic PCA specified by (2.1) and (2.2) as- 
sumes that, conditional on the principal component scores, the observations 
are independent. Since there exists spatial dependency among SNPs, one 
may have concerns about the validity of our analysis results if the depen- 
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Fig. 3. The SNP data: 90% bootstrap variability envelope (showed as lines) of the proba- 
bility estimates, using 100 randomly selected SNPs. Circles are the estimated probabilities 
TTij from the SNP data. Results are based on 100 bootstrap samples. 
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Fig. 4. Histograms of pairwise correlations of Pearson's residuals from nonregularized 
(left) and regularized (right) logistic PCA. 



SPARSE LOGISTIC PCA 



11 



dence is strong. In our data set, the 1536 SNPs were selected from the whole 
genome to capture most of the genetic diversity in population considering 
factors of physical distances, allele frequencies and linkage disequilibrium 
patterns. The selected SNPs are sufficiently well separated within each chro- 
mosome so that they can be representative of the whole genome [Serre et 
al. (2008)]. Therefore, we expect that the spatial dependency in this data 
set should not be too serious to invalidate our results. To address this issue 
empirically, we first computed Pearson's residuals after fitting the models for 
the nonregularized and regularized logistic PCA, then calculated pairwise 
correlations of these Pearson's residuals for all SNP pairs for each chromo- 
some. Figure 4 shows the histogram of the pairwise correlations for each 
model. For both models most pairwise correlations are close to zero, indi- 
cating that the SNPs are weakly correlated. We noticed that there exists a 
very small proportion of SNP pairs that are highly correlated. Examination 
of the physical locations revealed that those highly correlated SNP pairs 
consist of SNPs in close vicinity, indicating the imperfection of the initial 
SNP selection process. 

4. Computational algorithm. We develop a Majorization-Minimiza- 
tion (MM) algorithm for minimizing (2.6), which iteratively minimizes a 
suitably defined quadratic upper bound of (2.6). Instead of directly deal- 
ing with the nonquadratic log likelihood and the nondifferentiable sparsity 
inducing Li penalty, the MM algorithm sequentially optimizes a quadratic 
surrogate objective function. A function g{x\y) is said to majorize a function 
f{x) at y if 

g{x\y)>f{x) for ah X and g{y\y) = f{y). 

In the geometrical view the function surface g{x\y) lies above the function 
f{x) and is tangent to it at the point y so g{x\y) becomes an upper bound 
of f{x). To minimize f{x), the MM algorithm starts from an initial guess 
x^^^ of x, and iteratively minimizes g{x\x^"^^) until convergence, where x^"^^ 
is the estimate of x at the mth iteration. The MM algorithm decreases the 
objective function in each step and is guaranteed to converge to a local min- 
imum of f{x). When applying the MM algorithm, the majorizing function 
g{x\y) is chosen such that it is easier to minimize than the original objective 
function f{x). See Hunter and Lange (2004) for an introductory description 
of the MM algorithm. 

To find a suitable majorizing function of (2.6), we treat the log likelihood 
term and the penalty term separately. For the log likelihood term, note that, 
for a given point y, 

(4.1) -log7r(x) <-log7r(2/)-{l-7r(y)}(a;-y) + ^|^ {x - yf 

(4.2) < - log^(y) - {1 - ^(y)}(x - y) + ^(x - y)^, 
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and the equalities hold when x = y [Jaakkola and Jordan (2000); de Leeuw 
(2006)]. These inequalities provide quadratic upper bounds for the negative 
log inverse logit function at the tangent point y. We refer to the former 
bound as the tight bound, and the latter bound as the uniform bound since 
its curvature does not change with y. We pursue here the MM algorithm by 
using the uniform bound and leave the discussion of using the tight bound 
to the supplemental article [Lee, Huang and Hu (2010)]. Use of the tight 
bound usually leads to a smaller number of iterations of the algorithm but 
longer computation time because of the complexity involved in computing 
the bound. For the penalty term, the inequality 

x'^ + 1/^ 

(4.3) \x\<—-^, y/0, 

2|y| 

gives an upper bound for |x| and the equality holds when x = y [Hunter 
and Li (2005)]. Application of (4.2) and (4.3) yields a suitable majorizing 
function of (2.6) and thus an MM algorithm. 

Now we present details of the MM algorithm via the uniform bound. Let 
0{'") estimate of obtained in the mth step of the algorithm, with 

the entries o'^^^ = fi^"^^ + ajf^^'^hj^^ . By completing the square, the uniform 
bound (4.2) can be rewritten as 

(4.4) -log7r(x) <-log7r(y) + i[a::-y-4{l-7r(y)}]2. 

Substituting x and y with qijOij and qijO^J^^ respectively in (4.4) and noticing 
that Qij = ±1, we obtain 

(4.5) - logniq^.e,,) < -logniq^.e^f) + w^f (6^, - x^f^f, 
where w^'^^ = 1/8 and 

(4.6) ^!r^=C^+4<Z.,{l-vr(g.,el™))}. 

The superscript m of wj™'' and indicates the dependence on Q^"^\ 

Summing over all i, j of (4.5) and ignoring a constant term that does not 
depend on unknown parameters, we obtain the following quadratic upper 
bound of the negative log-likelihood: 

n d n d 

i=l j=l 2=1 j=l 

On the other hand, (4.3) implies that the penalty P\(B) has the following 
quadratic upper bound: 

j=l ^lOjl 1 i=l ^l°jA: I 
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Combining (4.7) and (4.8) yields the following quadratic upper bound (up 
to a constant) of the criterion function S'(^, A,B) defined in (2.6): 

ff(/x,A,B|/z(™),A("),B(™)) 

(4.9) 

n a 

i=i j=i 

where is a diagonal matrix with diagonal elements Xi / {2\b^j^^ \} for 

1 = 1,. ..,k. 

Theorem 4.1. (i) Up to a constant that depends on fj, (™), A^'") and 
B^'") but not on ^i, A and B, the function c/(/i, A, Bj/x^'"), A^™), B^™)) de- 
fined in (4.9) majorizes S{fi,A,B) at (/z(™), A('"),B(™)). 

(ii) Let (/i(™\ A(™\ B^"^)), m = 1,2, . . . , be a sequence obtained by iter- 
atively minimizing the majorizing function. Then 5(//("^), A^™), B(™)) de- 
creases as m gets larger and it converges to a local minimum of S{^i, A, B) 
as m goes to infinity. 

The majorizing function given in (4.9) is quadratic in each of jji, A and 
B when the other two are fixed and, thus, alternating minimization of (4.9) 
with respect to /x, A and B has closed-form solutions, which are given below. 
We now drop the superscript in x^J^^ for notational convenience. Recall that 

w']^^ = 1/8 is a constant. For fixed A and B, set x\- = Xij — a^bj, the 
optimal fij is given by 

n 1 " 

(4.10) Aj=argmin^(x^-^j)2 = -^2;t, j = l,...,d. 

i=i i=i 

This leads to a simple matrix formula /i = ^'K^'^ln, which is obtained by 
taking the column means of = (x|j ) . 

To update A and B for fixed n, set x*^ = Xij — fij or in matrix form, 
X* = {x*j) = X - 1^ (g) /x^. Denote the ith row vector of X* as x*^. For 
fixed /X and B, the ith row of A is updated by minimizing with respect to 
aj the sum of squares Yl'j=ii^ij ~ ^fbj)^ = (x* — Baj)^(x* - Ba^), which 
has a closed form solution 

(4.11) a, = (B^B)-iB^x*, i = l,...,n, 

or A = X*B(B^B)~^ in matrix form. The columns of updated A can be 
made orthonormal by using the QR decomposition. Denote the jth column 
vector of X* as x^. For fixed /x and A, the jth row of B is updated by 



14 



S. LEE, J. Z. HUANG AND J. HU 



solving the ridge regression problem that minimizes with respect to hj the 
penalized sum of squares 

i=i 1=1 ^i^j/ 1 

= ^(i* - Ab,-)^(i* - Ab,-) + nbjDAjb,, 
which has a closed form solution 

(4.12) bj = (A^A + 8nDAj)"^A^i*, j = l,...,d. 

Since, during the iteration, A is made orthonormal, A-^A becomes the iden- 
tity matrix of size k. Therefore, since the matrices to be inverted are diagonal 
matrices, hj can be obtained by component-wise shrinkage 

bji — — -r^ Xj, I — 1, . . . ,k,j — 1, . . . ,d, 

\b\i '\ +4nXi 

where a^ is the Ith. column of A. 

The MM algorithm will alternate between (4.10), (4.11) and (4.12) until 
convergence. The details are summarized in Algorithm 1. In this algorithm, 
k, the number of columns of A and B, should be specified in advance. Dif- 
ferent from the sequential extraction approach of Shen and Huang (2008), 
the matrices A and B obtained after applying Algorithm 1 depend on the 
value of k, but the results are reasonably stable when k is large enough. See 
Section 2.3 for discussion on choice of k. We use random initial values for 
fi, A and B. As with any nonlinear optimization algorithms, our algorithm 
is not guaranteed to converge to a global minimum. We can follow the com- 
mon practice to random start the algorithm several times and find the best 
solution. Our experience is that the algorithm with different initial values 
usually converges to the same solution (within the precision specified by the 
convergence criterion) . 

Algorithm 1 (Sparse logistic PC A algorithm I). 

1. Initialize with /^(i) = . . A^^) = {aS^\. . .,a'h^f and B^^) = 
(bS'\...,b^'))^. Setm = l. 

2. Compute x^f^ using (4.6) and set X(™) = (xj]"^). 

3. Set X(-)t = {x[f^) with x[f^ = x\f-a^^'^h'f\Vpdate fj, using = 

n "■ 

4. Set X(™+1)* = X^'") - 1„ O 
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5. Update A by A(™+i) = x('"+i)*B("^) (B('")^B('^))-i. Compute the QR 
decomposition A^*""^^) = QR and then replace A^™''"^^ by Q. 

6. Set C(-+i) = (47+')) = x(™+i)*^A(-+i). Update B by B(™+i) = {b^^^'^] 
where 

7. Repeat steps 2 through 6 with m replaced by m + 1 until convergence. 

Remark 2. The orthogonalization in step 5 of Algorithm 1 does not 
change the descent property of the MM algorithm. Let ^("^+^) be the op- 
timizer before orthogonalization. Then 5(^("'+^), S("^)) < 5(A(™),5(™)), 
where, for simplicity, // is omitted from the objective function S. Let = 
^(m+i) be the QR decomposition of and let S^™) = B^'^'iR^. Then 

^{m+l)^(m)T ^ ^^m+l)^(m)T ^^^^^ 5(^(™+l) , S^™) ) = 5(^(™+l) , S^*") ) . 

Consequently, 5(A("'+i),5(™)) < 5(A('"),5(™)). 

5. Handling missing data. Missing data are commonly encountered in 
real applications. In this section we extend our sparse logistic PCA method 
to cases when missing data are present. 

Let M = {{i,j)\yij is not observed} denote the index set for missing val- 
ues. The sparse logistic PCA minimizes the following criterion function: 

(5.1) r(/x, A,B) = -£„,,(/!, A, B) + nPA(B), 
where 

(5.2) 46,(/x, A,B) = ^^log7r{gij(/ij +afbj)} 

can be interpreted as the observed data log likelihood for model (2.2). Similar 
to the nonmissing data case, direct minimization of (5.1) is not straightfor- 
ward because the log likelihood term is not quadratic and the penalty term 
is nondifferentiable. Direct minimization of (5.1) is also complicated by the 
fact that the summation in the definition of the observed data log likelihood 
is not over a rectangular region. Again, we develop an iterative MM algo- 
rithm to solve the optimization problem. The strategy is to fill in the missing 
data with the fitted values based on the current parameter estimates, then 
proceed with the algorithm that assumes complete data, and iterate until 
convergence. 

Define the working variables 



(5.3) 
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where x-^ is defined in (4.6). Let 

/i(/x,A,B|/i("),A("),B("^)) 

(5.4) 

n d 

i=l j=l 

where d1^-^ are diagonal matrices with diagonal elements A//{2|6j™^|} for 
I = 1, . . . ,k. The following result extends Theorem 4.1 to the missing data 
case. The proof is given in the Appendix. 

Theorem 5.1. (i) Up to a constant that depends on A*^'"^ and 

B("^) hut not on fi, A andB, the function /i(/i, A,B|/x(™), A(™),B(™)) de- 
fined in (54) majorizes r(/i, A,B) at (/x("), A^™), B(™)). 

(ii) Let (/x^™'), A^'^^jB^™)), m = l,2,..., be a sequence obtained by iter- 
atively minimizing the majorizing function. Then T{fi^"^\ A^'^\'B^'^^) de- 
creases as m gets larger and it converges to a local minimum o/r(/i, A,B) 
as m goes to infinity. 

Note that the majorizing functions given in (5.4) have the same form as 
those given in (4.9) except that x-™^ in (4.9) is changed to z]-™^ in (5.4). Thus, 
the computation algorithm developed in Section 4 is readily applicable in the 
missing data case with a simple replacement of x\^^ by -Zj-"^^- The working 

variable z^"^^ in (5.4) is easily understood: It is the same as the nonmissing 
data case if yij is observable; otherwise, it is an imputed 9ij value based on 
the reduced rank model (2.2) and the current guess of fi, A and B. 

6. Simulation study. In this section we demonstrate our sparse logistic 
PC A method using a simulation study. The method worked well in various 
settings that we tested, but here we only report results in a challenging case 
that the number of variables d is bigger than the sample size n. 

6.1. The signal-to-noise ratio. To facilitate setting up simulation studies, 
we introduce a notion of signal-to-noise ratio for logistic PCA. In our logistic 
PCA model, the entries of the n x d data matrix are independent Bernoulli 
random variables with success probability vTjj = {1 + exp(— 6'jj)}~^ for the 
(i,j)th cell. The matrix of canonical parameters = {9ij) has a reduced 
rank representation = 1® fJ-^ + AB'^, where A is a n x /c matrix of 
PC scores and B is a sparse d x k PC loading matrix. In our simulation 
study, elements of the Ith column of A are independent draws from a zero- 
mean Gaussian distribution with variance a^,, 1 < I < k. The variance cr^. 
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measures the signal level of the /th PC. We set up the PC variances relative 
to a suitably defined baseline noise level. 

We define a baseline noise level for fixed n, d and k as follows. First 
we create a binary data matrix by generating n x d independent binary 
variables from Bernoulli distribution with the success probability 1/2. These 
binary variables are understood to come from the pure noise since they are 
generated without having any structure on the success probabilities. Then, 
we conduct a /c-component logistic PCA without regularization and compute 
the average of the sample variances of the obtained k PC scores, which is 
denoted as o"^. We repeat the above process of generating "pure noise" binary 
data matrices a large number of times (e.g., 100) and take the mean of o"^ 
computed from these matrices as the baseline noise level. 

With the notion of baseline noise level, we define the signal-to-noise ratio 
(SNR) for a PC as 

,^ , , variance of PC scores 

6.1 SNR=-— — -. 

baseline noise level 

In our simulation study we first compute the baseline noise level for a given 
combination of n, d and k, then use the above formula to specify the vari- 
ances of PC scores based on the fixed values of SNR. 



6.2. Simulation setup. We set the intrinsic dimension to be k = 2 and 
the number of rows of the data matrix to be n = 100. We varied the number 
of variables d and the signal-to-noise ratio SNR. We considered three choices 
oi d: d = 200, d = 500 and d = 1000. The scores of the lih. PC were randomly 
drawn from the A^(0, cj^;) distribution with fx^^ = SNR; • (baseline noise level), 
where SNR/ is the SNR for the lih PC. We considered two settings of SNR: 
(3,2) and (5,3). For example, when the SNR is (3,2), the variance of the 
first PC is 3 times the baseline noise level and the variance of the second 
PC is 2 times the baseline noise level. We construct two sparse PC loading 
vectors as follows: Let hji and hj2 denote correspondingly the components of 
the first and the second PC loading vectors. We let hji = \ for j = 1, . . . , 20, 
hj2 = 1 for j = 21, ... ,40 and the rest of hji are all taken to be 0. The mean 
vector was set to be a vector of zeros. 

6.3. Simulation results. Logistic PCA with and without sparsity induc- 
ing regularization was conducted on 100 simulated data sets for each setting. 
When applying the sparse logistic PCA algorithm, three choice of k were 
considered: k is fixed at the true value (/c = 2), at a moderately large value 
{k = 30), and selected using the BIC. The penalty parameter was selected 
using the method described in Section 2.2. 

To measure the closeness of the estimated PC loading matrix B and the 
true loading matrix B, we use the principal angle between spaces spanned by 
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Table 1 

The results of logistic PC A with and without sparsity inducing regularization, based on 
100 simulated data sets for each setting. The reported values are the mean (standard 
error) of the principal angle (°) between the estimated and the true PC loading matrices 



a 




I.. 

K = 


= 2 


fe = 


30 


aelected fc 


200 


SNR= (3,2) 
Nonregularized 
Regularized 


12.532 
5.860 


(0.115) 
(0.123) 


35.725 
10.125 


(0.177) 
(0.324) 


5.816 (0.125) 




SNR= (5,3) 
Nonregularized 
Regularized 


11.913 
5.803 


(0.122) 
(0.128) 


36.350 
9.843 


(0.189) 
(0.321) 


5.769 (0.127) 


500 


SNR= (3,2) 
Nonregularized 
Regularized 


10.890 
4.731 


(0.095) 
(0.115) 


31.884 
9.413 


(0.188) 
(0.282) 


4.690 (0.101) 




SNR= (5,3) 
Nonregularized 
Regularized 


10.166 
4.729 


(0.095) 
(0.121) 


31.941 
9.242 


(0.193) 
(0.252) 


4.544 (0.119) 


1000 


SNR= (3,2) 
Nonregularized 
Regularized 


12.018 
7.015 


(0.167) 
(0.486) 


36.040 
11.807 


(0.181) 
(0.433) 


4.534 (0.141) 




SNR= (5,3) 
Nonregularized 
Regularized 


11.370 
6.767 


(0.156) 
(0.474) 


36.144 
10.825 


(0.180) 
(0.475) 


4.196 (0.127) 



B and B . The principal angle measures the maximum angle between any two 
vectors on the spaces generated by the columns of B and B. More precisely, 
it is defined by cos~^(p) x ISO/tt, where p is the minimum eigenvalue of the 
matrix Q^Qb, where Qs and Qb are orthogonal basis matrices obtained 

by the QR decomposition of matrices B and B, respectively [Golub and van 
Loan (1996)]. 

The mean and standard deviation of principal angles for logistic PCA with 
and without regularization are presented in Table 1. Since smaller princi- 
pal angles indicate better estimates of the PC loading matrix, the sparsity 
inducing regularization has a clear benefit — it can substantially reduce the 
principal angles. The benefit is even more profound when the number of PCs 
used in the program [k = 30) is larger than the true number that was used 
to generate the data (fc = 2). The performance of sparse logistic PCA with 
selected k is similar to that when k is fixed at the true value. Frequencies 
of the selected k from 100 simulation data sets in each settings of Table 1 
are shown in Table 2. When d = 200, the BIC finds well the true k = 2 but, 
as d gets larger, there is a trend that a slightly larger k is selected. The 
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Table 2 

Frequencies of the selected k using the BIC 



Selected k 



d 


SNR 


1 


2 


3 


4 


5 


6 


7 


200 


(3,2) 





95 


5 
















(5,3) 





96 


4 














500 


(3,2) 


1 


58 


37 


4 













(5,3) 





60 


36 


3 


1 








1000 


(3,2) 


3 


34 


36 


15 


10 


1 


1 




(5,3) 


2 


31 


47 


15 


4 


1 






performance of using BIC to select k is considered as quite good, given that 
the sample size is only 100. 

A useful feature of the sparse logistic PCA is its ability to select rele- 
vant variables when estimating the PC loading vectors. A zero loading of a 
variable on a PC means that the corresponding variable is not used when 
forming that PC, and a nonzero loading indicates a useful variable. Our 
experience with simulated data shows that nonzero loadings can almost al- 
ways be identified by the method, but some identified nonzero loadings may 
correspond to irrelevant variables, cases of false positives. Table 3 presents 
the percentages of false positives for various settings reported in Table 1. 
When d is 500 or 1000, the percentages of false positives are low, all below 
20%. But when d is 200, the percentages of false positives are between 40% 
and 50%, suggesting big room for improvement in variable selection. 

7. Discussion and extension. In this paper we propose a sparse PCA 
method for analyzing binary data by maximizing a penalized Bernoulli like- 
lihood. The sparsity inducing Li penalty is used to acquire simple principal 

Table 3 

The results of logistic PCA with sparsity inducing regulanzation, based on 100 simulated 
data sets for each setting in Table 1. The reported values are the mean (standard error) 
of the percentages of false positives. The description of results is in the text 



d 


SNR 


k = 2 


fc = 30 


Selected fc 


200 


(3,2) 


45.05(1.54) 


41.51 (1.39) 


44.94 (1.51) 




(5,3) 


48.16 (1.63) 


40.53 (1.36) 


48.26 (1.63) 


500 


(3,2) 


14.83 (0.74) 


18.91 (0.51) 


16.70 (0.72) 




(5,3) 


16.06 (0.68) 


18.78 (0.42) 


16.93 (0.68) 


1000 


(3,2) 


10.87 (0.75) 


12.80 (0.73) 


10.13 (0.60) 




(5,3) 


10.89 (0.70) 


12.86 (0.73) 


9.26 (0.50) 
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components for the sake of easy interpretation and stable estimation. The 
MM algorithm developed for implementation of our method provides a uni- 
fied solution for dealing with (i) the nonquadratic likelihood, (ii) the nondif- 
ferentiable penalty function, and (iii) presence of missing data. Although the 
theoretical derivation is not straightforward, the steps of the algorithm are 
very simple — they are (weighted) penalized least squares with closed-form 
expressions. 

We have focused on the logit link so far, but other link functions can 
also be used. In particular, a slight modification of the proposed method 
can handle the probit link, where the success probabilities 6ij = ^"^(vrjj) 
with <!>(•) being the c.d.f. of the standard Gaussian distribution. The log 
likelihood function (2.3) of the reduced rank model is changed to 

d n 

(7.1) A,B) =Y,Y1 + ^^b^)}- 

j=i i=i 

Instead of using the majorization in (4.2), we apply the following upper 
bound to majorize the negative log likelihood: 

(7.2) - log $(x) < - log$(y) - ^{x -y) + ^{x - y)\ 

where </>(•) is the Gaussian density [Bohning (1999); de Leeuw (2006)]. Al- 
gorithm 1 still applies with appropriate changes to the definitions of the 
weights w^ij^ and the working variables x^^'^ . 

Our method can also be extended in a straightforward way to handle com- 
posite data which consists of both binary and continuous variables. While 
the binary variables are modeled with Bernoulli distributions, the continu- 
ous variables can be modeled with Gaussian distributions. Including some 
continuous variables corresponds to adding some negative Gaussian log like- 
lihood terms to the log likelihood expression (2.3). Since the Gaussian log 
likelihood is quadratic, it blends in easily with the quadratic majorization 
used for the logistic PGA. Specifically, if the j'th variable is of a continu- 
ous type, we assume yij ^ N{9ij,a'^) with 9ij satisfying (2.2), and simply 
let x[j^^ = yij and wj"^^ = 1/c"^ when forming the majorizing function (4.9). 
The residual variance cj^ of fitting the continuous variables can be estimated 
using the residual sum of squares. Taking into account the fact that differ- 
ent weighting schemes are used for the binary variables and the continuous 
variables in the majorizing function, a slight modification of Algorithm 2 
presented in the supplemental article [Lee, Huang and Hu (2010)] can be 
used for computation. 
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A.l. Proof of Theorem 4.1. We prove the results for both the tight and 
the uniform bound case. Apphcations of (4.1) and (4.2) yield the following 
majorizing functions of the negative log likelihood — ^(/x, A,B): 



EE 

i=i j=i 



-log7r(g,,el"^)) - g,,{l - vr(g,,^l"^0}(^ - 01 



+ 



aim). 2 
ij J 



for the tight bound, and 

d 



log7r(gij6'. 



«J ij 



EE 

i=i j=i 

for the uniform bound. Note that 



ij ' 



{2.(,.,^j-))-i}/{%,^ir^}={2-(ejr^) 



i}/{4^Jr^} 



for qij = ±1. By completing the squares and using the definitions of x, 
and w^ij^\ these majorizing functions can be rewritten as 

-^V^A.Bl/x^^^A^^^B^'")) 



(m) 



-^(0(-)) - 2 5: 5;{i - vr(g,-^j;))} V 5: 5: 



11 / 



i=l j=l 



i=l j=l 



On the other hand, application of (4.3) yields the following majorizing func- 
tion of Pa(B): 



Pa(B|B(")) = Ai^ 



d 1,2 I ^(m)2 



"j^ 



216 



(m) I 



+ • • • + Afc J]] 



jk 



i=i 



216' 



(m) I 



i=i i=i 

Since the majorization relation between functions is closed under the forma- 
tion of sums, -£ + nPA(B|B(™)) majorizes 5(/2, A, B) at (/x^'"), A(™),B("^)). 
Noticing that -i + nPx(B\B("^^) equals ^(/x, A,B|/2('"), A('^),B(™)) up to 
a constant independent of (/x, A,B), we complete the proof of part (i). Part 
(ii) of the theorem follows from the general property of the MM algorithm 
[Hunter and Lange (2004)]. □ 
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A. 2. Proof of Theorem 5.1. Note that the objective function to be min- 
imized is the summation of two terms — the log hkehhood term and the 
penalty term. Because the majorization property is closed under function 
summation, we deal with the two terms separately. We can find a majoriza- 
tion function of the penalty term as in Theorem 4.1. To find a majorization 
function of the log likelihood term, we apply the argument in the stan- 
dard EM algorithm for handling missing data [Dempster, Laird and Rubin 
(1977)]. The complete data log likelihood is 

£com(Ai, A,B) = ^^log7r(gij6'jj) -F^^log7r(gij%). 

Its conditional expectation given the observed data and the current guess of 
the parameter values is 

Q(/x,A,B|/x("),A("^),b(™)) 
(A.l) =^^log^(g,,e,,) 

where Yq denotes the observed data. By the standard EM theory, 

- iohs (/X, A, B) ^ -Q(/., A, B l/x^-) , A(-) , B^ ) _ (/^M , a(-) , B^™) ) 
(A.2) 

+ Q(/i(™) , A^'") , b("^) |/^("^) , A^'") , b("^) ) 

majorizes -£o6,(/x. A, B) at (/x^™), A("^),B(™)), that is, -46,(Ai,A,B) > 
-lobs (a*, A, B) , and the equality holds when (/x. A, B)_^ = (/x(") , A^™) , B(") ) . 

Now we find a quadratic majorizing function of — ^of,s(A*; A, B), which in 
turn majorizes — ^o6s(/^) A,B) because of the transitivity of the majoriza- 
tion relation. We need only to find a quadratic majorization function of 
— (5(/^, A,B|/x(™), A('"),B(™)), since it is the only term in the definition 
(A.2) of — ^o6s(/^) A, B) that depends on the unknown parameters. Accord- 
ing to (A.l), — (5(/x, A, B|/x(™) , A^™), B(™)) can be decomposed into two 
terms, one corresponding to observed data, the other corresponding to the 
missing data. The former term can been treated as in the proof of Theorem 

4.1. When (i,j) ^M, — logiT{qij9ij) is majorized by wlj^\9ij — xfj^^)"^, up to 
a constant. To treat the latter term, note that, when (zAf, 

[log ^(g,,0,, ) I Yo, /^(''") , A(-) , B(-)] 

= ni9lf)log7.{9,j) + {1 - vr(eJ7))} log{l - vr(^,,)} 
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using the fact that the missing data are independent of the observed data, 
and that 1 — 7r{9) = tt{—9). Then, by applying inequahties (4.1) and (4.2) 
and using the definition of w^ij'^ , we obtain that 

qij=±l 

<c^+«.;7){(0.,-0^7))}^ 

where Cm is a constant independent of /x, A and B. Combining the above 
results, we see that — (5(/^, A,B|/z(™), A(™\b(™)) is up to a constant ma- 
jorized by X^j^- w|"^{(6'jj - zI]"-*)}^, where 4™-* equals x'f^'' if (i, j) ^ and 

^(m) .|. ^^^j^ G A/". The proof of part (i) is thus complete. Part (ii) of the 
theorem follows from the general result of the MM algorithm. □ 

Acknowledgments. We would like to thank Editor Michael Stein, an As- 
sociate Editor and two referees for helpful comments. We would also like to 
thank Lan Zhou for help in improving the writing of the paper. 

SUPPLEMENTARY MATERIAL 

The MM algorithm for sparse logistic PCA using the tight bound (DOL 
10.1214/10-AOAS327SUPP; .pdf). We develop the MM algorithm for sparse 
logistic PCA using the tight majorizing bound. Comparison of the developed 
algorithm with the MM algorithm using the uniform bound in terms of 
computing time is also presented. 

REFERENCES 

BOHNING, D. (1999). The lower bound method in probit regression. Comput. Statist. Data 
Anal. 30 13-17. MR1681451 

Brookes, A. J. (1999). Review: The essence of SNPs. Gene 234 177-186. 

Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal 
component analysis to the exponential family. In Advanced in Neural Information Pro- 
cessing System (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 14 617-642. MIT 
Press, Cambridge, MA. 

DE Leeuw, J. (2006). Principal component analysis of binary data by iterated singular 
value decomposition. Comput. Statist. Data Anal. 50 21-39. MR2196220 

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from 
incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 
39 1-38. MR0501537 

Ewens, W. .J. and Spielman, R. S. (1995). The transmission/disequilibrium test: History, 
subdivision, and admixture. The American Journal of Human Genetics 57 455-464. 



24 



S. LEE, J. Z. HUANG AND J. HU 



GOLUB, G. and van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins 
Univ. Press, Baltimore, MD. MR1417720 

Hao, K., Li, C, Rosenow, C. and Wong, W. H. (2004). Detect and adjust for popula- 
tion stratification in population-based association study using genomic control markers: 
An application of Affymetrix Genechip® Human Mapping lOK array. European Journal 
of Human Genetics 12 1001-1006. 

HOTELLING, H. (1933). Analysis of a complex of statistical variables into principal com- 
ponents. Journal of Educational Psychology 24 417-441. 

Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 
30-37. MR2055509 

Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 
33 1617-1642. MR2166557 

Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via varia- 
tional methods. Statist. Comput. 10 25-37. 

JOLLIFFE, 1. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York. 
MR2036084 

JOLLIFFE, I. T., Trendafilov, M. and Uddine, M. (2003). A modified principal 
component technique based on the LASSO. J. Comput. Graph. Statist. 12 531-547. 
MR2002634 

KwOK, P. Y., Deng, Q., Zakeri, H., Taylor, S. L. and Nickerson, D. A. (1996). 

Increasing the information content of STS-based genome maps: Identifying polymor- 
phisms in mapped STSs. Genomics 31 123-126. 
Lange, K., Hunter, D. R. and Yang, 1. (2000). Optimization transfer using surrogate 

objective functions (with discussion). J. Comput. Graph. Statist. 9 1-20. MR1819865 
Lee, S., Huang, J. Z. and Hu, J. (2010). The MM algorithm for sparse logistic PGA 

using the tight bound: A supplementary note to "Sparse logistic principal components 

analysis for binary data." DOI: 10.1214/10- AOAS327SUPP. 
Liang, Y. and Kelemen, A. (2008). Statistical advances and challenges for analyzing 

correlated high dimensional SNP data in genomic study for complex diseases. Stat. 

Surv. 2 43-60. MR2520980 
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The 

London, Edinburgh and Dublin Pholosophical Magazine and Journal of Science, Sixth 

Series 2 559-572. 

RiSCH, N., Burchard, E., Ziv, E. and Tang, H. (2002). Gategorization of humans 
in biomedical research: Genes, race and disease. Genome Biology 3 comment 2007.1- 
2007.12. 

SCHEIN, A. I., Saul, L. K. and Ungar, L. H. (2003). A generalized linear model for 
principal component analysis of binary data. In Proceedings of the Ninth International 
Workshop on Artificial Intelligence and Statistics (C. M. Bishop and B. J. Frey, eds.) 
14-21. Key West, FL. 

Serre, D., Montpetit, A., Pare, G., Engert, J. G., Yusuf, S., Keavney, B., Hud- 
son, K. J. and Anand, S. (2008). Correction of population stratification in large 
multi-ethnic association studies. PLoS ONE 2 el382. 

Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized 
low rank matrix approximation. J. Multivariate Anal. 99 1015-1034. MR2419336 

The International HapMap Consortium (2005). A haplotype map of the human 
genome. Nature 437 1299-1320. 

Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. 
Soc. Ser. B 58 267-288. MR1379242 



SPARSE LOGISTIC PCA 



25 



Zou, H., Hastie, T. J. and Tibshirani, R. J. (2006). Sparse principal component anal- 
ysis. J. Comput. Graph. Statist. 15 265-286. MR2252527 

Zou, H., Hastie, T. J. and Tibshirani, R. J. (2007). On the "Degrees of Freedom" of 
the LASSO. Ann. Statist. 35 2173-2192. MR2363967 



,J. Hu 

Department of Biostatistics 

Division of Quantitative Sciences 

University of Texas M. D. Anderson Cancer Center 

Houston, Texas 77030-4009 

USA 

E-MAIL: jhu@mdanderson.org 



S. Lee 

Department of Statistics 

Hankuk University of Foreign Studies 

YONGIN-SI, GyEONGGI-DO 

449-771 Korea 
E-MAIL: lees@ufs.ac.kr 



,J. Z. Huang 

Department of Statistics 
Texas A&M University 
College Station, Texas 77843-3143 
USA 

E-MAIL: jianhua@stat.tamu.edu 



